hpgltools was written to make working with high-throughput data analyses easier. These analyses generally fall into a few stages:
Before any of these tasks may be performed, the data must be loaded into memory. hpgltools attempts to make this easier with create_expt() and subset_expt().
The following examples will use a real data set from a recent experiment in our lab. The raw data was processed using a mix of trimmomatic, biopieces, bowtie, samtools, and htseq. The final count tables were deposited into the ‘preprocessing/count_tables/’ tree. The resulting data structure was named ‘most_v0M1,’ named because it is comprised of count tables with 0 mismatches and 1 randomly-placed multi-match.
The annotation file was mgas_5005.gff.xz resides in ‘reference/gff/’.
The count tables and meta-data were loaded through the create_expt() function and the genome annotations were loaded with gff2df().
library(hpgltools)
data_file <- system.file("cdm_expt.rda", package="hpgltools")
cdm <- new.env()
load(data_file, envir=cdm)
rm(data_file)
ls()
## [1] "cdm" "hh" "ht" "rmd_file" "setnicepar"
Two variables should exist now: rmd_file in case I want to knitr this file, cdm which is a list including the data required to make an expressionset. Using this information, I can create an expressionset.
expt <- create_expt(count_dataframe=cdm$cdm_counts, metadata=cdm$cdm_metadata, gene_info=cdm$gene_info)
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
knitr::kable(head(expt$design))
| sampleid | type | stage | replicate | mutantname | media | exptdate | libdate | batch | condition | readspassed | ncrna | xncrna | remaining | genome | xgenome | otherbacterial | xother | colors | counts | intercounts | design | file | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HPGL0418 | HPGL0418 | WT | LL | 1 | 5448-1 | CDMg | 20130430 | 20140409 | a | wt_ll_cg | 17425675 | 350440 | 2.01% | 17075235 | 16061860 | 94.07% | 356542 | 2.09% | #E12C8C | processed_data/count_tables/HPGL0418/HPGL0418_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
| HPGL0419 | HPGL0419 | WT | LL | 2 | 5448-2 | CDMg | 20130430 | 20140409 | b | wt_ll_cg | 18106361 | 398984 | 2.20% | 17707377 | 16619183 | 93.85% | 392803 | 2.22% | #E12C8C | processed_data/count_tables/HPGL0419/HPGL0419_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
| HPGL0420 | HPGL0420 | mga | LL | 1 | Mga-1 | CDMg | 20130430 | 20140409 | a | mga1_ll_cg | 20499936 | 417440 | 2.04% | 20082496 | 18062942 | 89.94% | 416260 | 2.07% | #BE5067 | processed_data/count_tables/HPGL0420/HPGL0420_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
| HPGL0421 | HPGL0421 | mga | LL | 2 | Mga-2 | CDMg | 20130430 | 20140409 | b | mga1_ll_cg | 17216381 | 397021 | 2.31% | 16819360 | 14795703 | 87.97% | 376295 | 2.24% | #BE5067 | processed_data/count_tables/HPGL0421/HPGL0421_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
| HPGL0422 | HPGL0422 | WT | LL | 1 | 5448-1 | CDMf | 20130430 | 20140415 | a | wt_ll_cf | 17112706 | 367791 | 2.15% | 16744915 | 15114930 | 90.27% | 367974 | 2.20% | #8E7E40 | processed_data/count_tables/HPGL0422/HPGL0422_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
| HPGL0423 | HPGL0423 | WT | LL | 2 | 5448-2 | CDMf | 20130430 | 20140415 | b | wt_ll_cf | 18782729 | 395748 | 2.11% | 18386981 | 17330239 | 94.25% | 386443 | 2.10% | #8E7E40 | processed_data/count_tables/HPGL0423/HPGL0423_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
summary(expt)
## Length Class Mode
## title 1 -none- character
## notes 1 -none- character
## initial_metadata 23 data.frame list
## expressionset 1 ExpressionSet S4
## design 23 data.frame list
## conditions 8 -none- character
## batches 8 -none- character
## samplenames 8 -none- character
## colors 8 -none- character
## state 5 -none- list
## original_expressionset 1 ExpressionSet S4
## original_libsize 8 -none- numeric
## libsize 8 -none- numeric
## original_metadata 1 AnnotatedDataFrame S4
The data structure generated by create_expt() is a list containing the following slots:
The primary reason I created the expt object was that I did not fully understand how expressionSets worked. From my perspective, the documentation was rather opaque and therefore I decided to create a simpler version of the expressionset. As I learned how to manipulate S4 classes more efficiently, I gradually began to realize that they are not so bad. Nonetheless, I found it nice to be able to keep some extra state information with my expressionsets, along with a backup copy in case of mistakes, and some easier ways to extract information like conditions/batches/colors/etc. Thus my *expt() functions grew into wrappers to call the native functions for manipulating expressionsets.
With the above in mind, once we have the various metadata and count data loaded into memory, a most common next step is examine the data before molesting it:
raw_metrics <- sm(graph_metrics(expt, qq=TRUE))
The function graph_metrics() performs all of the likely plots one might want. Some of which are not really appropriate for non-normalized data unless it is incredibly well behaved (after 30 years, I still want to spell behaved ‘behaived’, why is that?).
## View a raw library size plot
raw_metrics$libsize
## Or boxplot to see the data distribution
raw_metrics$boxplot
## The warning is because it automatically uses a log scale and there are some 0 count genes.
## Perhaps you prefer density plots
raw_metrics$density
## quantile/quantile plots compared to the median of all samples
raw_metrics$qqrat
## Here we can see some samples are differently 'shaped' compared to the median than others
## There are other plots one may view, but this data set is a bit too crowded as is.
## The following summary shows the other available plots:
summary(raw_metrics)
## Length Class Mode
## nonzero 9 gg list
## libsize 9 gg list
## boxplot 9 gg list
## corheat 3 recordedplot list
## smc 9 gg list
## disheat 3 recordedplot list
## smd 9 gg list
## pcaplot 9 gg list
## pcatable 8 data.frame list
## pcares 4 data.frame list
## pcavar 7 -none- numeric
## tsneplot 9 gg list
## tsnetable 8 data.frame list
## tsneres 4 -none- list
## tsnevar 8 -none- numeric
## density 9 gg list
## legend 2 -none- list
## qqlog 3 recordedplot list
## qqrat 3 recordedplot list
## ma 0 -none- NULL
The plots are all generated by calling plot_something() where the somethings are:
On the other hand, we might take a subset of the data to focus on the late-log vs. early-log samples.
The expt_subset() function allows one to pull material from the experimental design.
Once we have a smaller data set, we can more easily use PCA to see how the sample separate.
head(expt$design)
## sampleid type stage replicate mutantname media exptdate libdate
## HPGL0418 HPGL0418 WT LL 1 5448-1 CDMg 20130430 20140409
## HPGL0419 HPGL0419 WT LL 2 5448-2 CDMg 20130430 20140409
## HPGL0420 HPGL0420 mga LL 1 Mga-1 CDMg 20130430 20140409
## HPGL0421 HPGL0421 mga LL 2 Mga-2 CDMg 20130430 20140409
## HPGL0422 HPGL0422 WT LL 1 5448-1 CDMf 20130430 20140415
## HPGL0423 HPGL0423 WT LL 2 5448-2 CDMf 20130430 20140415
## batch condition readspassed ncrna xncrna remaining genome
## HPGL0418 a wt_ll_cg 17425675 350440 2.01% 17075235 16061860
## HPGL0419 b wt_ll_cg 18106361 398984 2.20% 17707377 16619183
## HPGL0420 a mga1_ll_cg 20499936 417440 2.04% 20082496 18062942
## HPGL0421 b mga1_ll_cg 17216381 397021 2.31% 16819360 14795703
## HPGL0422 a wt_ll_cf 17112706 367791 2.15% 16744915 15114930
## HPGL0423 b wt_ll_cf 18782729 395748 2.11% 18386981 17330239
## xgenome otherbacterial xother colors
## HPGL0418 94.07% 356542 2.09% #E12C8C
## HPGL0419 93.85% 392803 2.22% #E12C8C
## HPGL0420 89.94% 416260 2.07% #BE5067
## HPGL0421 87.97% 376295 2.24% #BE5067
## HPGL0422 90.27% 367974 2.20% #8E7E40
## HPGL0423 94.25% 386443 2.10% #8E7E40
## counts
## HPGL0418 processed_data/count_tables/HPGL0418/HPGL0418_forward-trimmed-v1M1l20.count.xz
## HPGL0419 processed_data/count_tables/HPGL0419/HPGL0419_forward-trimmed-v1M1l20.count.xz
## HPGL0420 processed_data/count_tables/HPGL0420/HPGL0420_forward-trimmed-v1M1l20.count.xz
## HPGL0421 processed_data/count_tables/HPGL0421/HPGL0421_forward-trimmed-v1M1l20.count.xz
## HPGL0422 processed_data/count_tables/HPGL0422/HPGL0422_forward-trimmed-v1M1l20.count.xz
## HPGL0423 processed_data/count_tables/HPGL0423/HPGL0423_forward-trimmed-v1M1l20.count.xz
## intercounts design file
## HPGL0418 unknown a null
## HPGL0419 unknown b null
## HPGL0420 unknown a null
## HPGL0421 unknown b null
## HPGL0422 unknown a null
## HPGL0423 unknown b null
## elt stands for: "early/late in thy"
batch_a <- expt_subset(expt, subset="batch=='a'")
batch_b <- expt_subset(expt, subset="batch=='b'")
a_metrics <- graph_metrics(batch_a)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 475 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## There is just one batch in this data.
## Graphing a T-SNE plot.
## There is just one batch in this data.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 475 zero count features.
## Printing a color to condition legend.
b_metrics <- graph_metrics(batch_b)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 448 zero count features.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## There is just one batch in this data.
## Graphing a T-SNE plot.
## There is just one batch in this data.
## Plotting a density plot.
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0. We are on log scale, setting them to 0.5.
## Changed 448 zero count features.
## Printing a color to condition legend.
a_metrics$pcaplot
b_metrics$pcaplot
a_metrics$tsneplot
b_metrics$tsneplot
It is pretty obvious that the raw data is a bit jumbled according to PCA. This is not paricularly suprising since we didn’t normalize it at all. Therefore, in this block I will normalize it a few ways and follow up with some visualizations of showing how the apparent relationships change in the data.
## doing nothing to the data except log2 transforming it has a surprisingly large effect
norm_test <- normalize_expt(expt, transform="log2")
## This function will replace the expt$expressionset slot with:
## log2(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 923 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
l2_metrics <- sm(graph_metrics(norm_test))
## a quantile normalization alone affect some, but not all of the data
norm_test <- sm(normalize_expt(expt, norm="quant"))
q_metrics <- sm(graph_metrics(norm_test)) ## q for quant, oh oh oh!
## cpm alone brings out some samples, too
norm_test <- sm(normalize_expt(expt, convert="cpm"))
c_metrics <- sm(graph_metrics(norm_test)) ## c for cpm!
## low count filtering has some effect, too
norm_test <- sm(normalize_expt(expt, filter="pofa"))
f_metrics <- sm(graph_metrics(norm_test)) ## f for filter!
## how about if we mix and match methods?
norm_test <- sm(normalize_expt(expt, transform="log2", convert="cpm",
norm="quant", batch="combat_scale", filter=TRUE,
batch_step=4, low_to_zero=TRUE))
## Some metrics are not very useful on (especially quantile) normalized data
norm_graphs <- sm(graph_metrics(norm_test))
Now lets see some of the resulting metrics, in this case I will just compare some pca plots, as they are good at fooling our silly visual brains into seeing patterns.
l2_metrics$pcaplot
## Also viewable with plot_pca()$plot
## PCA plots seem (to me) to prefer log2 scale data.
q_metrics$pcaplot
## only normalizing on the quantiles leaves the data open to scale effects.
c_metrics$pcaplot
## but cpm alone is insufficient
f_metrics$pcaplot
## only filtering out low-count genes is helpful as well
norm_graphs$pcaplot
## The different batch effect testing methods have a pretty widely ranging effect on the clustering
## play with them by changing the batch= parameter to:
## "limma", "sva", "svaseq", "limmaresid", "ruvg", "combat", combatmod"
knitr::kable(norm_graphs$pcares)
| propVar | cumPropVar | cond.R2 | batch.R2 |
|---|---|---|---|
| 82.42 | 82.42 | 99.36 | 0.00 |
| 6.39 | 88.81 | 86.57 | 1.80 |
| 4.02 | 92.83 | 49.64 | 1.55 |
| 2.80 | 95.63 | 8.87 | 17.53 |
| 2.11 | 97.74 | 48.94 | 6.99 |
| 1.27 | 99.01 | 6.16 | 68.83 |
| 0.99 | 100.00 | 0.45 | 3.31 |
## Thus we see a dramatic decrease in variance accounted for
## by batch after applying limma's 'removebatcheffect'
## (see batch.R2 here vs. above)
norm_graphs$smc
norm_graphs$disheat ## svaseq's batch correction seems to draw out the signal quite nicely.
## It is worth noting that the wt, early log, thy, replicate c samples are still a bit weird.
norm_graphs$tsneplot
This is a relatively small data set, so performing some differential expression analyses really should not take long at all.
When performing these analyses with hpgltools, it will attempt to perform similar analyses with limma, edgeR, and DESeq2 via the all_pairwise() function. The most likely argument is ‘model_batch’ which may be used to explicitly include/exclude a batch factor in the model, or ask it to attempt including batch factors from sva/ruv/etc. By default it will attempt to include a column from the experimental design named ‘batch’.
spyogenes_de <- sm(all_pairwise(expt))
## Even the lowest correlations are quite high.
The result of all_pairwise() is a list of the results from limma, edger, and deseq. In addition, I implemented a very simplistic, differential expression function named ‘basic()’. It also provides some simple measurements of how well the various analyses agree (ergo the black and white heatmap).
Working with these separate tables can be more than a little annoying, combine_de_tables() attempts to simplify this. It will bring together the various tables, and if asked attempt to bring them together into a pretty-ified excel workbook.
all_pairwise() arbitrarily performs all possible pairwise comparisons. This is not necessarily what one actually wishes to see. Therefore, the argument keepers takes a list of contrasts:
my_keepers <- list(
"wt_media" = c("wt_ll_cf", "wt_ll_cg"),
"mga_media" = c("mga_ll_cf", "mga_ll_cg"))
In the above example, if the keepers argument to combine_de_tables() is given as my_keepers, then the resulting table will not have the set of 6 possible comparisons, but instead will only have 2 tables named ‘wt_media’ and ‘mga_media’, which if printed to excel will be sheets named accordingly.
Because these tables can get quickly extremely large, the excel workbooks may become too big for the zip(1) program to properly merge. If that happens, one may either remove some(or all) of the plots and/or have it print csv versions of the same tables.
spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel=FALSE))
summary(spyogenes_tables)
## Length Class Mode
## data 6 -none- list
## limma_plots 6 -none- list
## edger_plots 6 -none- list
## deseq_plots 6 -none- list
## comp_plot 0 -none- list
## venns 0 -none- list
## de_summary 22 data.frame list
## Try changing the p-adjustment
spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel=FALSE, padj_type="BH"))
head(spyogenes_tables$data[[1]])
## seqnames start end width strand source type id
## M5005_Spy0001 NC_007297 202 1557 1356 + RefSeq gene M5005_Spy0001
## M5005_Spy0002 NC_007297 1712 2848 1137 + RefSeq gene M5005_Spy0002
## M5005_Spy0003 NC_007297 2923 3120 198 + RefSeq gene M5005_Spy0003
## M5005_Spy0004 NC_007297 3450 4565 1116 + RefSeq gene M5005_Spy0004
## M5005_Spy0005 NC_007297 4635 5204 570 + RefSeq gene M5005_Spy0005
## M5005_Spy0006 NC_007297 5207 8710 3504 + RefSeq gene M5005_Spy0006
## dbxref gbkey name note gene
## M5005_Spy0001 GeneID:3571011 Gene dnaA M5005_Spy0001 dnaA
## M5005_Spy0002 GeneID:3571012 Gene dnaN M5005_Spy0002 dnaN
## M5005_Spy0003 GeneID:3571013 Gene M5005_Spy_0003 M5005_Spy0003 undefined
## M5005_Spy0004 GeneID:3571014 Gene ychF M5005_Spy0004 ychF
## M5005_Spy0005 GeneID:3571015 Gene pth M5005_Spy0005 pth
## M5005_Spy0006 GeneID:3571016 Gene trcF M5005_Spy0006 trcF
## locustag parent genesynonym limma_logfc
## M5005_Spy0001 M5005_Spy_0001 character(0) character(0) 0.22320
## M5005_Spy0002 M5005_Spy_0002 character(0) character(0) 0.17990
## M5005_Spy0003 M5005_Spy_0003 character(0) character(0) -0.06607
## M5005_Spy0004 M5005_Spy_0004 character(0) character(0) 0.06423
## M5005_Spy0005 M5005_Spy_0005 character(0) character(0) 0.07449
## M5005_Spy0006 M5005_Spy_0006 character(0) character(0) -0.11200
## limma_adjp deseq_logfc deseq_adjp edger_logfc edger_adjp
## M5005_Spy0001 0.5637 0.22190 0.6396 0.23690 0.6469
## M5005_Spy0002 0.7002 0.15950 0.7992 0.17470 0.7889
## M5005_Spy0003 0.9421 -0.04509 0.9705 -0.03071 1.0000
## M5005_Spy0004 0.9447 0.05406 0.9578 0.06851 0.9715
## M5005_Spy0005 0.9169 0.05262 0.9646 0.06755 0.9980
## M5005_Spy0006 0.6719 -0.09324 0.8751 -0.07858 0.9398
## limma_ave limma_t limma_b limma_p deseq_basemean deseq_lfcse
## M5005_Spy0001 9.525 1.3760 -6.173 0.2286 8667.0 0.2008
## M5005_Spy0002 9.621 0.9257 -6.657 0.3981 9174.0 0.2241
## M5005_Spy0003 5.855 -0.3763 -6.771 0.7225 699.7 0.2303
## M5005_Spy0004 10.060 0.3657 -7.081 0.7299 12590.0 0.2149
## M5005_Spy0005 6.452 0.4288 -6.768 0.6863 1017.0 0.2201
## M5005_Spy0006 9.907 -0.9875 -6.613 0.3698 11000.0 0.1784
## deseq_stat deseq_p edger_logcpm edger_lr edger_p
## M5005_Spy0001 1.1050 0.2691 9.562 1.48300 0.2234
## M5005_Spy0002 0.7119 0.4765 9.643 0.71140 0.3990
## M5005_Spy0003 -0.1958 0.8448 5.935 0.01412 0.9054
## M5005_Spy0004 0.2516 0.8014 10.100 0.11900 0.7302
## M5005_Spy0005 0.2391 0.8110 6.473 0.07032 0.7909
## M5005_Spy0006 -0.5226 0.6013 9.904 0.18440 0.6676
## basic_nummed basic_denmed basic_numvar basic_denvar
## M5005_Spy0001 9.801 9.581 7.389e-02 9.516e-03
## M5005_Spy0002 9.774 9.596 2.584e-02 1.152e-02
## M5005_Spy0003 6.242 6.311 5.987e-02 1.568e-04
## M5005_Spy0004 10.510 10.440 5.725e-02 2.709e-02
## M5005_Spy0005 6.651 6.576 1.134e-02 4.831e-03
## M5005_Spy0006 10.020 10.140 3.987e-02 2.776e-03
## basic_logfc basic_t basic_p basic_adjp limma_adjp_bh
## M5005_Spy0001 0.21930 -1.0740 4.471e-01 7.350e-01 5.636e-01
## M5005_Spy0002 0.17810 -1.3030 3.383e-01 6.858e-01 7.002e-01
## M5005_Spy0003 -0.06930 0.4000 7.575e-01 9.025e-01 9.421e-01
## M5005_Spy0004 0.06281 -0.3059 7.918e-01 9.187e-01 9.447e-01
## M5005_Spy0005 0.07502 -0.8344 5.038e-01 7.660e-01 9.169e-01
## M5005_Spy0006 -0.11570 0.7922 5.600e-01 7.986e-01 6.719e-01
## deseq_adjp_bh edger_adjp_bh basic_adjp_bh fc_meta fc_var
## M5005_Spy0001 7.297e-01 6.468e-01 6.791e-01 0.22720 7.468e-06
## M5005_Spy0002 8.853e-01 7.890e-01 6.066e-01 0.17230 2.446e-05
## M5005_Spy0003 1.000e+00 1.000e+00 8.810e-01 -0.05478 6.498e-04
## M5005_Spy0004 1.000e+00 9.715e-01 9.002e-01 0.06218 2.212e-06
## M5005_Spy0005 1.000e+00 9.979e-01 7.182e-01 0.06376 1.499e-05
## M5005_Spy0006 9.540e-01 9.397e-01 7.574e-01 -0.09577 1.045e-03
## fc_varbymed p_meta p_var
## M5005_Spy0001 3.287e-05 2.404e-01 6.260e-04
## M5005_Spy0002 1.420e-04 4.245e-01 2.026e-03
## M5005_Spy0003 -1.186e-02 8.242e-01 8.680e-03
## M5005_Spy0004 3.557e-05 7.538e-01 1.697e-03
## M5005_Spy0005 2.351e-04 7.627e-01 4.483e-03
## M5005_Spy0006 -1.091e-02 5.462e-01 2.445e-02
Finally, extract_significant_genes() may choose ‘significant’ genes based upon a few metrics including z-score vs. the distribution of logFC; a logFC cutoff, (ajusted)p-value cutoff, and/or top/bottom n genes.
spyogenes_sig <- sm(extract_significant_genes(spyogenes_tables, excel=FALSE))
knitr::kable(head(spyogenes_sig$limma$ups[[1]]))
| seqnames | start | end | width | strand | source | type | id | dbxref | gbkey | name | note | gene | locustag | parent | genesynonym | limma_logfc | limma_adjp | deseq_logfc | deseq_adjp | edger_logfc | edger_adjp | limma_ave | limma_t | limma_b | limma_p | deseq_basemean | deseq_lfcse | deseq_stat | deseq_p | edger_logcpm | edger_lr | edger_p | basic_nummed | basic_denmed | basic_numvar | basic_denvar | basic_logfc | basic_t | basic_p | basic_adjp | limma_adjp_bh | deseq_adjp_bh | edger_adjp_bh | basic_adjp_bh | fc_meta | fc_var | fc_varbymed | p_meta | p_var | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M5005_Spy1735 | NC_007297 | 1696949 | 1698145 | 1197 | - | RefSeq | gene | M5005_Spy1735 | GeneID:3571136 | Gene | speB | M5005_Spy1735 | speB | M5005_Spy_1735 | character(0) | character(0) | 3.509 | 0.0356 | 3.691 | 0 | 3.696 | 0 | 2.346 | 13.17 | 1.865 | 1e-04 | 83.91 | 0.4335 | 8.514 | 0 | 2.925 | 101.70 | 0 | 4.349 | 1.082 | 3.503e-02 | 1.818e-02 | 3.266 | -20.02 | 3.833e-03 | 6.256e-01 | 3.555e-02 | 1.078e-14 | 6.258e-21 | 2.895e-02 | 3.710 | 1.708e-01 | 4.602e-02 | 1.778e-05 | 9.487e-10 |
| M5005_Spy1575 | NC_007297 | 1535635 | 1536831 | 1197 | - | RefSeq | gene | M5005_Spy1575 | GeneID:3571303 | Gene | norA | M5005_Spy1575 | norA | M5005_Spy_1575 | character(0) | character(0) | 2.371 | 0.0356 | 2.406 | 0 | 2.422 | 0 | 6.834 | 12.96 | 2.727 | 1e-04 | 1849.00 | 0.2276 | 10.570 | 0 | 7.338 | 87.72 | 0 | 8.250 | 5.881 | 3.683e-02 | 5.610e-02 | 2.368 | -10.99 | 9.467e-03 | 6.256e-01 | 3.555e-02 | 3.902e-23 | 4.838e-18 | 6.812e-02 | 2.507 | 8.917e-02 | 3.557e-02 | 1.921e-05 | 1.107e-09 |
| M5005_Spy1715 | NC_007297 | 1675257 | 1678751 | 3495 | - | RefSeq | gene | M5005_Spy1715 | GeneID:3571155 | Gene | scpA | M5005_Spy1715 | scpA | M5005_Spy_1715 | character(0) | character(0) | 1.780 | 0.0359 | 1.837 | 0 | 1.850 | 0 | 5.291 | 11.71 | 1.906 | 1e-04 | 1182.00 | 0.2667 | 6.888 | 0 | 6.681 | 39.32 | 0 | 3.909 | 2.155 | 3.255e-03 | 1.759e-03 | 1.754 | -35.03 | 1.314e-03 | 6.256e-01 | 3.594e-02 | 1.553e-09 | 7.695e-08 | 1.025e-02 | 1.820 | 1.114e-01 | 6.122e-02 | 3.110e-05 | 2.901e-09 |
Since most of my circos graphs are for pyogenes, it is likely that the defaults are appropriate for this particular organism.
Much(all) of the following is taken from the material in tests/testthat/test_70mga.R
microbe_ids <- as.character(sm(get_microbesonline_ids("pyogenes MGAS5005")))
mgas_df <- sm(get_microbesonline_annotation(microbe_ids[[1]])[[1]])
## Error in get_microbesonline_annotation(microbe_ids[[1]]): could not find function "get_microbesonline_annotation"
mgas_df$sysName <- gsub(pattern="Spy_", replacement="Spy", x=mgas_df$sysName)
## Error in gsub(pattern = "Spy_", replacement = "Spy", x = mgas_df$sysName): object 'mgas_df' not found
rownames(mgas_df) <- make.names(mgas_df$sysName, unique=TRUE)
## Error in make.names(mgas_df$sysName, unique = TRUE): object 'mgas_df' not found
## First make a template configuration
circos_test <- circos_prefix()
## This assumes you have a colors.conf in circos/colors/ and fonts.conf in circos/fonts/
## It also assumes you have conf/ideogram.conf, conf/ticks.conf, and conf/housekeeping.conf
## It will write circos/conf/mgas.conf with a reasonable first approximation config file.
## Creating the data directory: circos/data
## The circos directory does not exist, creating: circos/conf
## The karyotype directory does not exist, creating: circos/conf/karyotypes
## The ideogram directory does not exist, creating: circos/conf/ideograms
## Wrote karyotype to 5
## This should match the karyotype= line in mgas.conf
## Fill it in with the data for s.pyogenes
circos_kary <- circos_karyotype("mgas", length=1895017)
## Wrote karyotype to circos/conf/karyotypes/mgas.conf
## This should match the karyotype= line in mgas.conf
## Fill in the gene category annotations by gene-strand
circos_plus <- circos_plus_minus(mgas_df, circos_test)
## Error in circos_plus_minus(mgas_df, circos_test): object 'mgas_df' not found
circos_limma_hist <- circos_hist(spyogenes_de$limma$all_tables[[1]], mgas_df, circos_test, outer=circos_plus)
## Error in as.data.frame(y): object 'mgas_df' not found
circos_deseq_hist <- circos_hist(spyogenes_de$deseq$all_tables[[1]], mgas_df, circos_test, outer=circos_limma_hist)
## Error in as.data.frame(y): object 'mgas_df' not found
circos_edger_hist <- circos_hist(spyogenes_de$edger$all_tables[[1]], mgas_df, circos_test, outer=circos_deseq_hist)
## Error in as.data.frame(y): object 'mgas_df' not found
circos_suffix(cfgout=circos_test)
circos_made <- circos_make(target="mgas")
## For some reason this fails weirdly when not run interactively.
getwd()
## [1] "/cbcb/nelsayed-scratch/atb/git/hpgltools/vignettes"
circos result
GenoplotR is new to me, but it seems to work?
genoplot_chromosome()
## Warning in strings[i] <- downloaded: number of items to replace is not a
## multiple of replacement length
## Warning: closing unused connection 5 (circos/Makefile)
pander::pander(sessionInfo())
R version 3.4.1 (2017-06-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.utf8, LC_NUMERIC=C, LC_TIME=en_US.utf8, LC_COLLATE=en_US.utf8, LC_MONETARY=en_US.utf8, LC_MESSAGES=en_US.utf8, LC_PAPER=en_US.utf8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.utf8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets and base
other attached packages: ruv(v.0.9.6) and hpgltools(v.2017.01)
loaded via a namespace (and not attached): nlme(v.3.1-131), bitops(v.1.0-6), matrixStats(v.0.52.2), devtools(v.1.13.2), bit64(v.0.9-7), doParallel(v.1.0.10), RColorBrewer(v.1.1-2), rprojroot(v.1.2), GenomeInfoDb(v.1.12.2), tools(v.3.4.1), backports(v.1.1.0), R6(v.2.2.2), rpart(v.4.1-11), KernSmooth(v.2.23-15), Hmisc(v.4.0-3), DBI(v.0.7), lazyeval(v.0.2.0), BiocGenerics(v.0.22.0), mgcv(v.1.8-17), colorspace(v.1.3-2), ade4(v.1.7-6), nnet(v.7.3-12), withr(v.1.0.2), RMySQL(v.0.10.12), gridExtra(v.2.2.1), DESeq2(v.1.16.1), bit(v.1.1-12), compiler(v.3.4.1), preprocessCore(v.1.38.1), Biobase(v.2.36.2), htmlTable(v.1.9), DelayedArray(v.0.2.7), labeling(v.0.3), checkmate(v.1.8.3), scales(v.0.4.1), DEoptimR(v.1.0-8), robustbase(v.0.92-7), genefilter(v.1.58.1), stringr(v.1.2.0), digest(v.0.6.12), foreign(v.0.8-69), rmarkdown(v.1.6), XVector(v.0.16.0), base64enc(v.0.1-3), htmltools(v.0.3.6), limma(v.3.32.4), highr(v.0.6), htmlwidgets(v.0.9), rlang(v.0.1.1), RSQLite(v.2.0), BiocInstaller(v.1.26.0), BiocParallel(v.1.10.1), gtools(v.3.5.0), acepack(v.1.4.1), RCurl(v.1.95-4.8), magrittr(v.1.5), GenomeInfoDbData(v.0.99.0), Formula(v.1.2-2), Matrix(v.1.2-10), Rcpp(v.0.12.12), munsell(v.0.4.3), S4Vectors(v.0.14.3), stringi(v.1.1.5), yaml(v.2.1.14), edgeR(v.3.18.1), SummarizedExperiment(v.1.6.3), zlibbioc(v.1.22.0), Rtsne(v.0.13), plyr(v.1.8.4), grid(v.3.4.1), blob(v.1.1.0), parallel(v.3.4.1), ggrepel(v.0.6.5), crayon(v.1.3.2), genoPlotR(v.0.8.6), methods(v.3.4.1), lattice(v.0.20-35), splines(v.3.4.1), pander(v.0.6.0), annotate(v.1.54.0), locfit(v.1.5-9.1), knitr(v.1.16), GenomicRanges(v.1.28.4), corpcor(v.1.6.9), geneplotter(v.1.54.0), reshape2(v.1.4.2), codetools(v.0.2-15), stats4(v.3.4.1), XML(v.3.98-1.9), evaluate(v.0.10.1), latticeExtra(v.0.6-28), data.table(v.1.10.4), foreach(v.1.4.3), testthat(v.1.0.2), gtable(v.0.2.0), ggplot2(v.2.2.1), openxlsx(v.4.0.17), xtable(v.1.8-2), survival(v.2.41-3), tibble(v.1.3.3), iterators(v.1.0.8), AnnotationDbi(v.1.38.2), memoise(v.1.1.0), IRanges(v.2.10.2), cluster(v.2.0.6) and sva(v.3.24.4)